Mining Frequent Itemsets from Secondary Memory 



Gosta Grahne and Jianfei Zhu 
Concordia University 
Montreal, Canada 
{grahne, j_zhu}@cs.concordia.ca 

March 6, 2004 



Abstract 

Mining frequent itemsets is at the core of mining association rules, and is by now 
quite well understood algorithmically. However, most algorithms for mining frequent 
itemsets assume that the main memory is large enough for the data structures used 
in the mining, and very few efficient algorithms deal with the case when the database 
is very large or the minimum support is very low. Mining frequent itemsets from 
a very large database poses new challenges, as astronomical amounts of raw data is 
ubiquitously being recorded in commerce, science and government. 

In this paper, we discuss approaches to mining frequent itemsets when data struc- 
tures are too large to fit in main memory. Several divide-and-conquer algorithms are 
given for mining from disks. Many novel techniques are introduced. Experimental re- 
sults show that the techniques reduce the required disk accesses by orders of magnitude, 
and enable truly scalable data mining. 

1 Introduction 

Mining frequent itemsets is a fundamental problem for mining association rules [1J EJ 1111 
EHl EH El EOl EH] • lt also P la Y s an important role in many other data mining tasks such as 
sequential patterns, episodes, multi-dimensional patterns and so on |)1^JE]1- In addition, 
frequent itemsets are one of the key abstractions in data mining. 

The description of the problem is as follows. Let / = 12, ■ ■ ■ , ij, ■ ■ ■ i n }, be a set of 
items. Items will sometimes also be denoted a, b,c, . . .. An I-transaction r is a subset of /. 
An /-transactional database T> is a finite bag of /-transactions. The support of an itemset 
S C / is the proportion of transactions in T> that contain S. The task of mining frequent 
itemsets is to find all S such that the support of S is greater than some given minimum 
support £, where £ either is a fraction in [0, 1], or an absolute count. 

Most of the algorithms, such as Apriori DepthProject pj], and dEclat work well 
when the main memory is big enough to fit the whole database or /and the data struc- 
tures (candidate sets, FP-trees, etc). When a database is very large or when the minimum 
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support is very low, either the data structures used by the algorithms may not be accom- 
modated in main memory, or the algorithms spend too much time on multiple passes over 
the database. In the First IEEE ICDM Workshop on Frequent Itemset Mining Implemen- 
tations, FIMI '03 |18| . many well known algorithms were implemented and independently 
tested. The results show that 11 none of the algorithms is able to gracefully scale-up to very 
large datasets, with millions of transactions" |19j . 

At the same time very large databases do exist in real life. In a medium sized business or 
in a company big as Walmart, it's very easy to collect a few gigabytes of data. Terabytes 
of raw data is ubiquitously being recorded in commerce, science and government. The 
question of how to handle these databases is still one of the most difficult problems in data 
mining. 

A few researchers have tried to mine frequent itemsets from very large databases. One 
approach is by sampling. For instance, JS] picks a random sample of the database, finds 
all frequent itemsets from the sample, and then verifies the results with the rest of the 
database. This approach needs only one pass of the database. However, the results are 
probabilistic, meaning that some critical frequent itemsets could be missing. 

Partitioning JHj is another approach for mining very large databases. This approach 
first partitions the database into many small databases, and mines candidate frequent 
itemsets from each small database. One more pass over the original database is then done 
to verify the candidate frequent itemsets. The approach thus needs only two database 
scans. However, when the data structures used for storing candidate frequent itemsets are 
too big to fit in main memory, a significant amount of disk I/O's is needed for the disk 
resident data structures. 

In [HIE]) Han et. al. introduce the FP-growth method, which uses two database scans 
for constructing an FP-tree from the database, and then mines all frequent itemsets from 
the FP-tree. Two approaches are suggested for the case that the FP-tree is too large to fit 
into main memory. 

The first approach writes the FP-tree to disk, then mines all frequent sets by reading 
the frequency information from the FP-tree. However, the size of the FP-tree could be 
same as the size of the database, and for each item in the FP-tree, we need at least one 
FP-tree traversal. Thus the I/O's for writing and reading the disk-resident FP-tree could 
be prohibitive. 

The second approach projects the original database on each frequent item, then mines 
frequent itemsets from the small projected databases. One advantage of this approach 
is that any frequent itemset mined from a projected database is a frequent itemset in 
the original database. To get all frequent itemsets, we only need to take the union of the 
frequent itemsets from the small projected databases. This is in contrast to the partitioning 
approach, where all candidate frequent itemsets have to be stored and later verified by 
another pass of database. The biggest problem of the projection approach is that the total 
size of the projected databases could be too large, and there will be too many disk I/O's 
for the projected databases. 
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Contributions 

In this paper we consider the problem of mining frequent itemsets from very large databases. 
We adopt a divide-and-conquer approach. First we give three algorithms, the general 
divide-and-conquer algorithm, then an algorithm using naive projection, and an algorithm 
using aggressive projection. We also analyze the number of steps and disk I/O's required 
by these algorithms. 

In a detailed divide-and-conquer algorithm, called Diskmine, we use the highly efficient 
FP- growth* method to mine frequent itemsets from an FP-tree in main memory. We 
describe several novel techniques useful in mining frequent itemsets from disks, such as the 
array technique, the item-grouping technique, and memory management techniques. 

Finally, we present experimental results that demonstrate the fact that our Diskmine- 
algorithm outperforms previous algorithms by orders of magnitude, and scales up to ter- 
abytes of data. 

Overview 

The remainder of this paper is organized as follows. In Section 2 we introduce approaches 
for mining frequent itemsets from disks. Three algorithms are introduced and analyzed. 
Section 3 gives a detailed divide-and-conquer algorithm Diskmine, in which many novel 
optimization techniques are used. These techniques are also described in Section 3. Ex- 
perimental results are given in Section 4. Section 5 concludes, and outlines directions for 
future research. 

2 Mining from disk 

How should one go about when mining frequent itemsets from very large databases residing 
in a secondary memory storage, such as disks? Here "very large" means that the data 
structures constructed from the database for mining frequent itemsets can not fit in the 
available main memory. 

Basically, there are two strategies for mining frequent itemsets, the datastructures 
approach, and the divide-and-conquer approach. 

The datastructures approach consists of reading the database buffer by buffer, and 
generate datastructures (i.e. candidate sets or FP-trees). Since the datastructure don't fit 
into main memory, additional disk I/O's are required. The number of passes and disk I/O's 
required by the approach depends on the algorithm and its datastructures. For example, if 
the algorithm is Apriori 5 j using a hash-tree for candidate itemsets |15j . disk based hash- 
trees have to be used. Then the number of passes for the algorithm is same as the length 
of the longest frequent itemset, and the number of disk I/O's for the hash-trees depend on 
the size of the hash-trees on disk. 
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The basic strategy for the divide- and- conquer approach is shown in Figure ^ In the 
approach, \T>\ denotes the size of the data structures used by the mining algorithm, and 
M is the size of available main memory. Function mainmine is called if candidate frequent 
itemsets (not necessary all) can be mined without writing the data structures used by a 
mining algorithm to disks. In Figure ^ a very large database is decomposed into a number 
of smaller databases. If a "small" database is still too large, i.e, the data structures are 
still too big to fit in main memory, the decomposition is recursively continued until the 
data structures fit in main memory. After all small databases are processed, all candi- 
date frequent itemsets are combined in some way (obviously depending on the way the 
decomposition was done) to get all frequent itemsets for the original database. 

Procedure diskmine(T> , M) 

if \T>\ < M then return mainmine (V ') 
else decompose D into T>\, . . . D^. 

return combine diskmine(T>i, M ), 

diskmine(T>k, M). 

Figure 1: General divide- and-conquer algorithm for mining frequent itemsets from disk. 

The efficiency of diskmine depends on the method used for mining frequent itemsets in 
main memory and on the number of disk I/O's needed in the decomposition and combi- 
nation phases. Sometimes the disk I/O is the main factor. Since the decomposition step 
involves I/O, ideally the number of recursive calls should be kept small. The faster we can 
obtain small decomposed databases, the fewer recursive call we will need. On the other 
hand, if a decomposition cuts down the size of the projected databases drastically, the 
trade-off might be that the combination step becomes more complicated and might involve 
heavy disk I/O. 

In the following we discuss two decomposition strategies, namely decomposition by 
partition, and decomposition by projection. 

Partitioning is an approach in which a large database is decomposed into cells of small 
non-overlapping databases. The cell-size is chosen so that all frequent itemsets in a cell 
can be mined without having to store any data structures in secondary memory. However, 
since a cell only contains partial frequency information of the original database, all frequent 
itemsets from the cell are local to that cell of the partition, and could only be candidate 
frequent itemsets for the whole database. Thus the candidate frequent itemsets mined 
from a cell have to be verified later to filter out false hits. Consequently, those candidate 
sets have to be written to disk in order to leave space for processing the next cell of the 
partition. After generating candidate frequent itemsets from all cells, another database 
scan is needed to filter out all infrequent itemsets. The partition approach therefore needs 
only two passes over the database, but writing and reading candidate frequent itemsets will 
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involve a significant number of disk I/O's, depending on the size of the set of candidate 
frequent itemsets. 

We can conclude that the partition approach to decomposition keeps the recursive levels 
down to one, but the penalty is that the combination phase becomes expensive. 

To get an easier combination phase, we adopt another decomposition strategy, which we 
call projection. Suppose for simplicity that there are four items, a, b, c, and d, and let T> be 
a database of transactions containing some or all of these items. We could then decompose 
T> into for instance T> a b and T> C( i- Typically, we would do this when the descending order 
of frequency of the items is a,b, c, d. In T> c d we put all transactions containing at c or d (or 
both). In T> a b we put transactions containing a or b (or both), and for each transaction we 
store only the a, 6-part. Thus we will have shorter transactions in T> a b, and both T> a b and 
T> c d contain fewer transactions than T>. We can then recursively mine all frequent itemsets 
from T> a b, and T> C( i- Since this decomposition is not a partition, the projected databases 
might not be that much smaller that the original database. The upside is though that the 
set of all frequent itemsets in T> now simply is the union of the frequent itemsets in T> a b 
and T> C( i. This means that the combination phase in diskmining is a simple union. 

To illustrate this decomposition, let T> contain the transactions {a, b, d}, {b, c, d}, {a, c} 
and {a, b}. Suppose the minimum support is 50%, then T> cc i = {{a, b, d}, {&, c, d}, {a, c}}, 
Dab = {{a, b}, {b}, {a}, {a, b}}. From V c d, we get all frequent itemsets {d}, {&, d}, and {c}. 
Note though {a} and {6} are also frequent in T> ca -, they're not listed since they contain 
neither c nor d. They will be listed in the frequent itemsets of T> a b, which are {a}, {&}, and 
{a,b}. 

To analyze the recurrence and required disk I/O's of the general divide-and-conquer 
algorithm when the decomposition strategy is projection, let us suppose that: 

- The original database size is D bytes. 

- The data structure is an FP-tree. 

- The FP-tree constructed from original database T> is T, and its size is \T\ bytes. 

- If a conditional FP-tree T 1 is constructed from an FP-tree T, then \T'\ < c • |T|, for some 
constant c < 1. 

- The main memory mining method is the FP- growth method [HIE]- Two database scans are 
needed for constructing an FP-tree from a database. 

- The block size is B bytes. 

- The main memory available for the FP-tree is M bytes 

In the first line of the algorithm in Figure ^ if T can n °t fit in memory, then projected 
databases will be generated. We assumed that the size of the FP-tree for a projected 
database is o|T|. If c-\T\ < M, function mainmine can be called for the projected database, 
otherwise, the decomposition goes on. At pass m, the size of the FP-tree constructed from 
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a projected database is c m ■ \T\. Thus, the number of passes needed by the divide-and- 
conquer projection algorithm is 1 + [log c M/T~\. Based on our experience and the analysis 
in [HI El; we can sa y that for all practical purposes the number of passes will be at most 
two. For example, Let D = 100 Giga and T = 10 Giga, M = 1 Giga, c = 10%. Then the 
number of passes is 1 + [log .1 2 30 /(10 x 2 30 )] = 2. In five passes we can handle databases 
up to 100 Terabytes. Namely, we get 1 + \log 1 2 30 /(W x 2 40 )] = 5. 

Assume that there are two passes, and that the sum of the sizes of all projected 
databases is D'. There are two database scans for D, one for finding all frequent sin- 
gle items, one for decomposition. Two scans need 2 x D/B disk I/O's. The projected 
databases have to be written to the disks first, then later each scanned twice for building 
the FP-tree. This step needs 3 x D'/B disk I/O's. Thus, the total disk number of disk 
I/O's for the general divide-and-conquer projection algorithm is at least 

2- D/B + 3- D'/B. (1) 

Obviously, the smaller D', the better the performance. 

One of the simplest projection strategies is to project the database on each frequent 
item, which we call naive projection. First we need some formal definitions. 

Definition 1 Let I be a set of items. By I* we will denote strings over I, such that each 
symbol occurs at most once in the string. If a, [3 are strings, and ij an item, then a.f3 
denotes the concatenation of the string a with the string [3. 

For a string a, we shall denote by {a}, the set of items occurring in it. 

Let T> be an /-database. Then freqstring(T>) is the string over I, such that each 
frequent item in T> occurs in it exactly once, and the items are in decreasing order of 
frequency in T>. ■ 

As an example, consider the {a, b, c, d}-database V = {{a, b, c}, {o, b, c, d}, {a, c}}. If 
the minimum support is 60%, then freqstring(T>) = acb. Note that {acb} = {a,c,b}. 

Definition 2 Let T> be an /-database, and let freqstring(T>) = i\i2---ik- For j G 
{1, . . . , k} we define X>j. = {r H . . . , ij} : ij £ r, r G V}. 

Let a E I*. We define V a inductively: V t = V, and let freqstring(V a ) = i\i2 ■ ■ - ik- 
Then, for j <E {l,...,k}, V aA . = {r O . . . :ijeT,T£ V a }. ■ 

Obviously, T) a i j is an {i\,... , ij}-database. The decomposition of T> a into "Dct-hi ■ ■ ■ > 
T^a.i k is called the naive projection. 

Definition 3 Let a £ I*, ij £ I, and let T> a .ij be an /-database. Then freqsets{^,T> a ^ j ) 
denotes the subsets of / that contain ij and are frequent in T> a ,i- when the minimum 
support is £. Usually, we shall abstract £ away, and write just /regsets(P a .j j ) ■ 
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Lemma 1 Let T> a be an I -database, and freqstring(D a ) = ■ ■ -ik- Then 

freqsets(D a ) = [J freqsets(V aAj ) 
je{l,...,k} 

Proof. (C- direction). Let S £ freqsets(T> a ), and suppose i n is the item in S that is 
least frequent in V a . Since T> aAn is an {ii, . . . , i n }-database, and transactions in T> a that 
contain item ij are all in T> aA ., if S is frequent in T> a , then S must be frequent in T> aA .. 
(5- direction). For any frequent itemset S £ freqsets(T> aAj ), according to the definition, 
the support of any itemset in T> aAj is not greater than the support of it in V a . Therefore, 
S must be frequent in V a . ■ 

Figure |21 gives a divide-and-conquer algorithm that uses naive projection. A transaction 
t in D a will be partly inserted into T> aAj if and only if r contains ij. The parallel projection 
algorithm introduced in [S] is an algorithm of this kind. 

Procedure naivediskmine(T> a , M) 

if |X^Q! | < M then return mainmine( T> a ) 
else let freqstring(T> a ) = i\i<i ■ ■ ■ i n 
return naivediskmine(T> ' aAl ,M) U 

. . . U 

naivediskmine{T> a i n , M) . 
Figure 2: A simple divide-and-conquer algorithm for mining frequent itemsets from disk 

Let's analyze the disk I/O's of the algorithm in Figure [2 As before, we assume that 
there are two passes, that the data structure is an FP-tree, and that the main memory 
mining method is FP-growth. If in T> e , each transaction contains on the average n frequent 
items, each transaction will be written to n projected databases. Thus the total length of 

the associated transactions in the projected databases is n + (n — 1) H h 1 = n(n + l)/2, 

the total size of all projected databases is (n + l)/2 • D ~ n/2 ■ D. 

There are two database scans for T> e , one for finding all frequent single items, and one 
for decomposition. Two scans need 2 • D/B disk I/O's. The projected databases have to 
be written to the disks first, then later scanned twice each for building an FP-tree. This 
step needs at least 3 • n/2 x D/B. Thus, the total disk I/O's for the divide-and-conquer 
algorithm with naive projection is 

2-D/B + n-3/2-D/B (2) 

The recurrence structure of algorithm naivediskmine is shown in Figure 01 The reader 
should ignore nodes in the shaded area at this point, they represent processing in main 
memory. 

In a typical application n, the average number of frequent items could be hundreds, or 
thousands. It therefore makes sense to devise a smarter projection strategy. Before we go 
further, we introduce some definitions and a lemma. 
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Figure 3: Recurrence structure of Naive Projection 

Definition 4 Let T> a be an /-database, and let freqstring(T> a ) = Pi-fa- ■ ■ ■ -Ac, where 
each j3j is a string in I*. We call fix-fii- • • • -Pk a grouping of freqstring(T> a ). For j G 
{1, . . . , n}, we now define P a .^ = {r D {/3i, . . . , f3j} : r G V a , r n /3j / 0}. 

In T> a p p items in {/3j} are called master items, items in . . . are called slave 

items. ■ 



For example, if freqstring(V a ) = abode, 0\ = abc, 02 = de gives the grouping abc.de 
of abode. 

Definition 5 Let {a,/3} C /*, and let T> a p be an /-database. Then freqsets(T> a ^) 
denotes the subsets of / that contain at least one item in {/?} and are frequent in T> a p. 

■ 

Lemma 2 Let a G I* , V a be an I -database, and freqstring(T> a ) = P1P2 • • • fik- Then 

freqsets(V a ) = [J freqsets(T> a ^-) 
je{i,...,k} 

Proof. Straightforward from Lemma 1 and the definition of T> a ^. ■ 

Based on Lemma we can obtain a more aggressive divide-and-conquer algorithm for 
mining from disks. Figure0]shows the algorithm aggressivediskmine. Here, freqstring(T> a ) 
is decomposed into several substrings (3j, each of which could have more than one item. 
Each substring corresponds to a projected database. A transaction r in V a will be partly 
inserted into T> a _g. if and only if r contains at least one item a such that a G {f3j}. Since 
there will be fewer projected databases, there will be less disk I/O's. Compared with the 
algorithm in Figure El we can expect that a large amount of disk I/O will be saved by the 
algorithm in Figure 0J 

Let's analyze the recurrence and disk I/O's of the aggressive divide-and-conquer algo- 
rithm. The number of passes needed by the algorithm is still 1 + |~log c M/T~\ ~ 2, since 
grouping items doesn't change the size of an FP-tree for a projected database. However, 
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Procedure aggressivediskmine(T> a , M) 
if \D a \ < M then return mainmine( T> a ) 
else let freqstring(T> a ) = flifo • • • Pk 

return aggressivediskmine(T) a ./3 1 , M) U 

... U 

aggressivediskmine('D a ./3 k , M). 
Figure 4: A more aggressive divide-and-conquer algorithm for mining frequent itemsets from disk 

for disk I/O, suppose in V e , each transaction contains on average n frequent items, and 
that we can group them into k groups of equal size. Then the n items will be written to 
the projected databases with total length n/k + 2 • n/k + ... + &■ n/k = {k + l)/2 ■ n. 
Total size of all projected databases is (k + l)/2 • D rj k/2 ■ D. The total disk I/O's for the 
aggressive divide-and-conquer algorithm is then 

2-D/B + k-3/2-D/B (3) 

The recurrence structure of algorithm aggressivediskmine is shown in Figure El Com- 
pared to Figure 01 we can see that the part of the tree that corresponds to decomposition 
(the nonshaded part) is much smaller in Figure [5] Although the example is very small, it 
exhibits the general structure of the two trees. 
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If k <C n, we can expect that the aggressive divide and conquer algorithm will signifi- 
cantly outperform the naive one. 

3 Algorithm Diskmine 

In this section we give the details of our divide-and-conquer algorithm for mining frequent 
itemsets from secondary memory. We call the algorithm Diskmine. In the algorithm, the 
FP-tree is used as data structure and the extension of FP-growth method, FP-growth* [Jj, 
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as method for mining frequent itemsets from an FP-tree. Before introducing the algorithm, 
let's first recall the FP-tree and the FP-growth* method. 

3.1 The FP-tree and FP-growth* method 

The FP-tree (Frequent Pattern tree) is a data structure used in the FP-growth method by 
Han et al. 8j. It is a compact representation of all relevant frequency information in a 
database. The nodes of the FP-tree stores an item name, item count, and a link. Every 
branch of the FP-tree represents a frequent itemset, and the nodes along the branches 
are stored in decreasing order of the frequency of the corresponding items, with leaves 
representing the least frequent items. Compression is achieved by building the tree in such 
a way that overlapping itemsets share prefixes of the corresponding branches. 

The FP-tree has a header table associated with it. Single items and their counts are 
stored in the header table in decreasing order of their frequency. The entry for an item 
also contains the head of a list that links all the nodes of the item in the FP-tree. 

The FP-growth method needs two database scans when mining all frequent itemsets. 
The first scan counts the number of occurrences of each item. The second scan constructs 
the initial FP-tree, which contains all frequency information of the original dataset. Mining 
the database then becomes mining the FP-tree. 

The FP-growth method relies on the following principle: if X and Y are two itemsets, 
the count of itemset X U Y in the database is exactly that of Y in the restriction of the 
database to those transactions containing X. This restriction of the database is called the 
conditional pattern base of X, and the FP-tree constructed from the conditional pattern 
base is called X's conditional FP-tree, which we denote by Tx- We can view the FP-tree 
constructed from the initial database as T$, the conditional FP-tree for 0. Note that for 
any itemset Y that is frequent in the conditional pattern base of X, the set X U Y is a 
frequent itemset for the original database. 1 

The recursive structure of FPgrowth can be seen from the shaded area in Figure |21 In 
the figure, we will enter the main memory phase for instance for the conditional database 
T> a . Then FP-growth first constructs the FP-tree T a from T> a . The tree rooted at T a shows 
the recursive structure of FP-growth, assuming for simplicity that the relative frequency 
remains the same in all conditional pattern bases. 

In [Jj , we extend the FP-growth method into the FP-growth * method by using an array 
technique and other optimizations. The experimental results in the paper and those done 
by the FIMI-organizers show that the FP-growth* method outperforms the FP-growth 
method especially when the database is big or sparse [71ITS]. 

In keeping with the notation introduced so far, we shall in the sequel write T a when we mean the 
FP-tree T{ a }- Similarly we shall write T a .i instead of Tjaiu-H}- 
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The array technique 

In the original FP-growth method [HI, to construct an FP-tree from a database T>, two 
database scan are required. The first scan gets all frequent items, the second constructs 
the FP-tree. And later, for each item a in the header of a conditional FP-tree T a , two 
traversals of T a are needed for constructing the new conditional FP-tree T a i . The first 
traversal finds all frequent items in the conditional pattern base of a.i, and initializes the 
FP-tree T a i by constructing its header table. The second traversal constructs the new tree 
T ■ 

In the boosted FP-growth* method |7j, a simple data structure, an array, is introduced 
to omit the first scan of T a . This is achieved by constructing an array A a while building 
T a . More precisely, in the second scan of the original database we construct T e , and an 
array A t . The array will store the counts of all 2-itemsets, each cell [j, k] in the array is a 
counter of the 2-itemset {ij,ik}- All cells in the array are initialized to 0. When an itemset 
is inserted into T e , the associated cells in A e are updated. After the second scan, the array 
A e contains the counts of all pairs of items frequent in T> e . 

Next, the FP-growth* method is recursively called to mine frequent itemsets for each 
item in header table of T e . However, now for each item i, instead of traversing T t along 
the linked list starting at i to get all frequent items in i's conditional pattern base, A e [i, *] 
gives all frequent items for i. Therefore, for each item i in T e the array A e makes the first 
traversal of T e unnecessary, and T e .j can be initialized directly from A e . 

For the same reason, from a conditional FP-tree T a , when we construct a new condi- 
tional FP-tree for a.i, for an item i, a new array A a ^ is calculated. During the construction 
of the new FP-tree T Q .j, the array A a i is filled. The construction of arrays and FP-trees 
continues until the FP-growth method terminates. 

Note that if for a database, if we have the array that stores the count of all pairs of 
frequent items, then only one database scan is needed to construct an FP-tree from the 
database. 

3.2 Divide-and-conquer by aggressive projection 

The algorithm Diskmine is shown in Figure|f)J In the algorithm, T> a is the original database 
or a projected database, and M is the maximal size of main memory that can be used by 
Diskmine. 

Diskmine uses the FP-tree as data structure and FP-growth* [7] as main memory mining 
algorithm. Since the FP-tree encodes all frequency information of the database, we can 
shift into main memory mining as soon as the FP-tree fits into main memory. 

Since an FP-tree usually is a significant compression of the database, our Diskmine al- 
gorithm begins optimistically, by calling trialmainmine, which starts scanning the database 
and constructing the FP-tree. If the tree can be successfully completed and stored in main 
memory, we have reached the bottom level of the recursion, and can obtain the frequent 
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Procedure Diskmine(T> a , M) 

scan V a and compute freqstring(V a ) 

call trialmainmine(T) a , M) 

if trialmainmine(T> a , M) aborted then 

compute a grouping P1P2 • • • Pk of freqstring(D a ). 
Decompose V a into V a , /3l V a .p k 
for j = 1 to k do begin 
if {(3j} is a singleton then 

Diskmine(T> a .p , M) 
else 

mainmine(V * a .p ) 

end 

else return freqsets{V a ) 

Figure 6: Algorithm Diskmine 

itemsets of the database by running FP- growth* on the FP-tree in main memory. 

Procedure trialmainmine(D a , M) 

start scanning V a and building the FP-tree 

T a in main memory, 
if |T Q | exceeds M then 

return the incomplete T a 
else 

call FP- growth* {T a ) and return freqsets(V a ). 

Figure 7: Trial main memory mining algorithm 

If, at any time during trialmainmine we run out of main memory, we abort and 
return the partially constructed FP-tree, and a pointer to where we stopped scanning 
the database. We then resume processing Diskmine(V a , M) by computing a grouping 
Pi,... ,(3k °f freqstring(V a ), and then decomposing V a into V a ^ x , . . . ,V a p k . We recur- 
sively process each decomposed database T) a .fiy During the first level of the recursion, 
some groups Pj will consist of a single item only. If {Pj} is a singleton, we call Diskmine, 
otherwise we call mainmine directly, since we put several items in a group only when we 
estimate that the corresponding FP-tree will fit into main memory. 

In computing the grouping Pi , . . . , P^ we assume that transactions in a very large 
database are evenly distributed, i.e., if an FP-tree is constructed from part of a database, 
then this FP-tree represents the whole FP-tree for the whole database. In other words, if 
the size of the FP-tree is n for p% of the database, then the size of the FP-tree for whole 
database is n/p ■ 100. Most of the time, this gives an overestimation, since an FP-tree 
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increases fast only at the beginning stage, when items are encountered for the first time 
and inserted into the tree. In the later stages, the changes to the FP-tree will be mostly 
counter updates. 

Procedure mainmine(V ' a .p) 

build a modified FP-tree T a a for T> a g 

for each i G {/?} do begin 

construct the FP-tree T a ., for T> a ^ from T a g 

call FP- growth* (T a _i) and return freqsets(T> a .i)- 
end 

Figure 8: Main memory mining algorithm 

Since we know that there is only one master item in the database (for T> t , no master 
item at all), an FP-tree is constructed without the master item. In Figure |HJ since T> a p is 
for multiple master items, the FP-tree constructed from T> a .B has to contain those master 
items. However, the item order is a problem for the FP-tree, because we only want to 
mine all frequent itemsets that contain master items. To solve this problem, we simply use 
the item order in the partial FP-tree returned by the aborted trialmainmine{T> a ) . This is 
what we mean by a "modified FP-tree" on the first line in the algorithm in Figure |HJ 

The entire recurrence structure of Diskmine can be seen in Figure El Compared to the 
naive projection in Figure |3] we see that since the aggressive projection uses main memory 
more effective, the decomposition phase is shorter, resulting in less I/O. 

Theorem 1 Diskmine{T>) returns freqsets(T>) . 

Proof. The correctness of Diskmine can be derived from the correctness of the FP-growth* 
method in [7] and Lemma 12] in Section [21 In Diskmine, each item acts as master item in 
exactly one projected database. If a projected database is only for one master item ij, the 
result of FP-growth* method or a recursive call of Diskmine will be freqsets(Di.). If a 
projected database is for a set {/3} of master items, it contains all frequency information 
associated with the master items. Since in the FP-growth* method, the order of the 
items in an FP-tree doesn't influence the correctness of the FP-growth* method, mainmine 
indeed returns only frequent itemsets that contain master item(s), i.e. mainmine gives the 
exact value of freqsets(T> a ^). According to Lemma HI algorithm Diskmine then correctly 
outputs all itemsets in frequent the original database. ■ 

3.3 Memory Management 

Given a database T> a , to successfully apply the FP-growth* method, the basic main memory 
requirement is that the size of the FP-tree T a constructed from T> a , is less than the available 
amount M of main memory. In addition, we need space for the descendant conditional 
FP-trees that will be constructed during the recursive calls of FP-growth*. 



13 



Suppose the main memory requirement for T a plus its descendant FP-trees is m. If 
M < m, but the difference m — M is not very big, the FP- growth* method could still be 
run because the operating system uses virtual memory. However, there could be too many 
page swappings which takes too much time and makes FP-growth* very slow. Therefore, 
given M, for a very large database T> a , we have to stop the construction of the FP-tree T a 
and the execution of FP-growth* method before all physical main memory is used up. 

Another problem is that we will construct a large number of FP-trees. Since there can 
be millions of nodes in those FP-trees, inserting and deleting nodes is time consuming. 

In the implementation of the algorithm, we use our own main memory management 
for allocating and deallocating nodes, and calculating the main memory we have already 
used. We assume that the main memory needed by an FP-tree is proportional to the 
number of nodes in the FP-trees. We also assume that the workspace needed for calling 
FP-growth* (T) method on an FP-tree is roughly 10% of the size of the FP-tree T. Here, 
10% is a liberal assumption according to the experimental result in [H]. Later in this 
section, a more accurate value will be given. If the size of FP-tree is more than 0.9 • M, we 
conclude that M is not big enough to store whole FP-tree T a . 

Since all memory for nodes in an FP-tree is deallocated after a call of FP-growth* ends, 
a chunk of memory is allocated for each FP-tree when we create the tree, and the chunk 
size is changeable. After generating all frequent itemsets from the FP-tree, the chunk is 
discarded, and all nodes in the tree are deleted. Thus we successfully avoid freeing nodes 
in FP-trees one by one, which would take too much time. 

3.4 Applying the Array Technique 

In Diskmine, the array technique is also be applied to save FP-tree traversals. Furthermore, 
when projected databases are generated, the array technique can save a great number of 
disk I/O's. 

Recall that in trialmainmine, if an FP-tree can not be accommodated in main memory, 
the construction stops. Suppose now we decided to stop scanning the database. Then later, 
after generating all projected databases, for a projected database with only one master item, 
two database scans are required to construct an FP-tree for the master item. The first 
scan gets all frequent items for the master item, the second scan constructs the FP-tree. 
For a projected database with several master items, though the FP-tree constructed from 
the database uses the modified item order (the order from the header of the FP-tree in 
the previous level of the recursion), to construct new FP-trees for the master items, two 
FP-tree traversals are needed. To avoid the extra scan, in Diskmine we calculate an array 
for each FP-tree. When constructing the FP-tree from T> a , if it is found that the tree 
can not fit in main memory, the construction of the FP-tree T a stops, but the scan of the 
database T> a continues so that we finish filling the cells of the array A a . Here, some extra 
disk I/O's are spent, but the payback will be that we save one database scan for each 
projected database. Furthermore, finishing the scanning of T> a doesn't require any more 
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main memory, since the array A a is already there. 

From the array, for each projected database, the count of each pair of master items 
and the count of each pair of master item and slave item can be known. As an ex- 
ample, suppose a projected databases is only for one master item ij and slave items 
ii, . . . , ij—i. To mine all frequent itemsets, from the line for ij in the array, accurate counts 
for [ij, ij—i], [ij,ij-2], ■ ■ ■ ,[ij,ii] can be easily found. If there were no array we would need 
an extra database scan. 

With the array, we can also make a projected database drastically smaller. In the 
definition of T> a _p., we see that T> a ./3 1S an • • • , /3j}-database. Actually, by checking 
the array A a , if a slave item is found not frequently co-occurring with any master item 
in f3j, it's useless to include the slave item in f aiJ g., because no frequent itemsets mined 
from T> a ./3j wu l contain that slave item. For same reason, if we also find that a master 
item a is not frequent with any other master item or slave item, it will be not written 
to T> a p., either. However, the frequent itemset a. a is outputted. Furthermore, if from 
the array, we see that a master item a is only frequent with one item (master or slave) 
b, frequent itemsets a. a and a.a.b are outputted directly, and item a will not appear in 
T^a.j3j- Therefore, by looking through the array, we find all slave items, such that they are 
not frequent with any master item in (3j, and all master items, such that their number 
of frequent items in {f3\, . . . , f3j} is or 1. When generating T) a .p., all those items are 
removed from the transactions we put in "D a .Pj ■ 

3.5 Statistics 



t(V a ) 


Number of transactions in T> a 


Aa\j,k] 


Count of frequent item pair {ij,ik} i n T^a 


t(T a ) 


Number of transactions used for constructing T a 


v(T a ) 


Number of nodes in T a 




Number of nodes in T a if we retain only nodes for items ii, . . . , ij 




Number of nodes in T, where a node P for item zj. is counted if 

it satisfies the following conditions: 1) P is in a branch that contains ij 

2) i k g {ii,...,^} 3) A a [j,k] > £ 



Table 1: Statistics Information 



Algorithm Diskmine collects some statistics on the partial FP-tree T a and the rest of 
database T> a , for the purpose of grouping items together. Table Q shows the statistics 
information. In the table, T> a is the original database or the current projected database, 
and freqstring(T> a )= i\ . . . ij . . . . . . i n . The partial FP-tree is T a and £ is the absolute 

value of the minimum support. 

In the table, the array discussed in Section f3. 41 is also listed as statistics. Values for the 
cells of the array are accumulated during the construction of the partial T a . If trialmain- 



15 



mine is aborted, the rest of the statistics is collected by scanning the remaining part of T> a . 
Values in v[j](T a ) can also be obtained during the construction of T a . Here ^[j](T a ) records 
the size of the FP-tree after T a is trimmed and only contains items i\, . . . , ij. Notice that 
v{T a ) is equal to ^[^(Tq,). This is also the size of a tree that can fit in main memory. The 
value for fi\j](T a ) can be obtained by traversing T a once, it gives the size of the FP-tree 

Ta.ij ■ 

It might seem that collecting all this statistics is a large overhead, however, since all 
work is done in main memory, it doesn't take much time. And the time saved for disk 
I/O's is far more than the time spent on gathering statistics. 

3.6 Grouping items 

In Figure El the fourth line computes a grouping /3i/?2"'Ae of freqstring(V a ). Each 
string f3 corresponds to a group and each j3 consists of at least one item. For each j3, a 
new projected database T> a p will be computed from V a , then written to disk and read 
from disk later. Therefore, the more groups, the more disk I/O's. In other words, there 
should be as many items in each (3 as possible. To group items, two questions have to be 
answered. 

1. If (3 currently only has one item ij, after projection, is the main memory big enough 
for accommodating T a .j. constructed from T> a .i- and running the FP- growth * method 
on T aA p. 

2. If more items are put in (3, after projection, is the main memory big enough for 
accommodating T a ^ constructed from T) a p and running FP-growth* on T a ^ only 
for items in (31 

Answering the first question is pretty easy, since for each item ij, the number /j,\j](T a ) 
gives the size of an FP-tree if the tree is constructed from the partial FP-tree T a . Therefore 
/j,[j](T a ) can be used to estimate the size of FP-tree Tq,.^. . By the assumption that the 
transactions in T> a are evenly distributed and that the partial T a represents the whole 
FP-tree for V a , the estimated size of FP-tree T a ,i j is fi[j](T a ) ■ t(V a )/t(T a ). 

Before answering the second question, we introduce the cut point from which the first 
group can be easily found. 

Finding the cut point. Recall the order that FP-growth* uses in mining frequent item- 
sets. Starting from the least frequent item i n , all frequent itemsets that contains i n are 
mined first. Then the process is repeated for i n -\, and so on. Notice that when min- 
ing frequent itemsets for ifc, all frequency information about ik+i, ■ ■ ■ ,i n is useless. Thus, 
though a complete FP-tree T a constructed from T> a could not fit in main memory, we can 
find many /c's such that the trimmed FP-tree containing only nodes for items i^, . . . ,i\ 
will fit into main memory. All frequent itemsets for ik, ■ ■ ■ , i\ can be then mined from one 
trimmed tree. We call the biggest of such fc's the cut point. At this point, main memory 
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is big enough for storing the FP-tree containing only ik,. . . and there is also enough 
main memory for running FP-growth* on the tree. Obviously, if the cut point k can be 
found, items ik, . . . , i\ can be grouped together. Only one projected database is needed for 
i k , ...,ix. 

There are two ways to estimate the cut point. One way is to get cut point from 
the value of t(T> a ) and t(T a ) in Table ^ Figure El illustrates the intuition behind the 
cut point. In the figure, since the partial FP-tree for t{T a ) of t{T> a ) transactions can be 
accommodate in main memory, we can expect that the FP-tree containing i^, . . . , where 
k = \ n ■ t(T a )/t(T> a )\, also will fit in main memory. 



T 



Fi gure 9: Cut Point. Here I — i(T' Q ,), and m — 

The above method works well for many databases, especially for those databases whose 
corresponding FP-trees have plenty of sharing of prefixes for items from i\ to the cut point. 
However, if the FP-tree constructed from a database doesn't share prefixes that much, the 
estimation could fail, since now the FP-tree for items from i\ to the cut point could be too 
big. Thus, we have to consider another method. In Tabled v \j\(T a ) records the size of the 
FP-tree after the partial FP-tree T a is trimmed and only contains items i\,... ,ij. Based 
on v[j\(T a ) the number of nodes in the complete FP-tree for item ij can be estimated as 
v[j](T a ) ■ t(V a )/t(T a ). Now, finding the cut point becomes finding the biggest k such that 
v[k](T a ) ■ t(V a )/t{T a ) < u(T a ), and v[k + l)(T a ) ■ t(V a )/t(T a ) > u{T a ). 

Sometimes the above estimation only guarantees that the main memory is big enough 
for the FP-tree which contains all items between i\ and the cut point, while it doesn't guar- 
antee that the descendant trees from that FP-tree can fit in main memory. This is because 
the estimation doesn't consider the size of descendant trees correctly (in Section XA.\\\ we 
assumed that the size of a conditional tree is 10% of its nearest ancestor tree). Actually, 
from fi[j](T a ) we can get a more accurate estimation of the size of the biggest descen- 
dant tree. To find the cut point, we need to find the biggest k, such that {v{k](T a ) + 
H\j](T a )) ■ t(V a )/t(T a ) < u(T a ), and {u[k + 1](T«) + fi[m]{T a )) > u(T a ), where j < k, 
^[j]( T a) = max j( z {1 ^ k} fi[j}(T a ), and m < k+ 1, fi[m](T a ) = max me{h _ jk+1 yn[m}(T a ). 

Grouping the rest of the items. Now we answer the second question, how to put more 
items into a group? Here we still need fi[j](T a ). Starting with ^[outpoint + l](T a ), we test 
if fi[cutpoint + l](T a ) • t(T> a ) /t(T a ) > u(T a ). If not, we put next item cutpoint+2 into the 
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group, and test if (fj,[cutpoint + l](T a ) + ^[outpoint + 2](T a )) -t{V a ) / t(T a ) > u(T a ). We 
repeatedly put next item in freqstring(T>) into the group until we reach an item ij, such 
that 

j 

v[m](T a ) ■ t(V a )/t(T a ) > u{T a ). 

m=cutpoint+l 

Then starting from ij, we put items into next group, until all items find its group. 

Why can we group items together? This is because even if we construct T a ,i- , . . . , T a ,i k 
from the projected databases V a g. , ...,T> a n. and put all of them into main memory, 
the main memory is big enough according to the grouping condition. At this stage, 
T a ^ j , . . . , T a .i k all can be constructed by scanning T> a once. Then we mine frequent item- 
sets from the FP-trees. However, we can do better. Obviously T a ^. , . . . , T a ,i k overlap a 
lot, and the total size of the trees is definitely greater than the size of T a g. It also means 
that we can put more items into each (3, only if the size of T a g is estimated to fit in main 
memory. To estimate the size of T a .g, part of T a has to be traversed by following the links 
for the master items in T a . 

3.7 Database projection 

After all items have found their groups, the original database will be projected to small 
databases according to Definition |IJ To save disk I/O's, three techniques can be used: 

1. In a group (3, if the number of master items is greater than half of the number 
of frequent items (this often happens in the group that contains cut point), then 
V a .g is not necessary computed. To mine all frequent itemsets, T a .g can be directly 
constructed from T> a by reading it once. This is because T> a g is not much smaller 
than T> a , while the disk I/O's for reading from T> a once is less than the disk I/O's 
for writing and reading T> a g once. 

2. Since the partial tree T a now in main memory, records all frequency information of 
those transactions that have been read so far, when computing projected databases, 
the frequency information of those transactions can be gotten from T a . Thus disk 
I/O's are only spent on reading from those transactions that did not contribute to 
T a . 

3. As discussed in Section l3~H by using the array technique, in group j3j, we find all 
slave items, such that they are not frequent with any master item in (3j, and all 
master items, such that their number of frequent items in {/3\, . . . , (3j} is or 1. 
When computing T> a .Bj , all those items are removed from new transactions in T> a ./3j ■ 



18 



3.8 The disk I/O's 

Let's re-count the disk I/O's used in Diskmine. Prom the first scan we get all frequent 
items in T> e , which needs D/B disk I/O's. In the second scan we construct a partial FP- 
tree T e , then continue scanning the rest database for statistics, which needs another D/B 
disk I/O's. Suppose then that k projected databases have to be computed. According to 
Section the total size of the projected databases is approximately k/2-D. For computing 
the projected databases, the frequency information in T e is reused, so only part of T> e is 
read. We assume on average half of T> e is read at this stage, which means 1/2 • D/B disk 
I/O's. Writing and later reading k projected databases will take 2-k/2-D/B = k-D/B disk 
I/O's. Suppose all frequent itemsets can be mined from the projected databases without 
going to the third level. Then the total disk I/O's is 

3/2 - D/B + k- D/B (4) 

Compared with formulaGl Diskmine saves at least k/2 ■ D/B disk I/O's, thanks to the 
various techniques used in the algorithm. 

4 Experimental Evaluation and Performance Study 

In this section, we present the results from a performance comparison of Diskmine with 
the Parallel Projection Algorithm in ^9; and the Partitioning Algorithm introduced in |15j . 
The scalability of Diskmine is also analyzed, and the accurateness of our memory size 
estimations are validated. 

As mentioned in Section El the Parallel Projection Algorithm is a naive divide-and- 
conquer algorithm, since for each item a projected database is created. For performance 
comparison, we implemented Parallel Projection Algorithm, by using FP-growth as main 
memory method, as introduced in The Partitioning Algorithm is also a divide- and- 
conquer algorithm. We implemented the partitioning algorithm by using the Apriori imple- 
mentation [2]. We chose this implementation, since it was well written and easy to adapt 
for our purposes. 

We ran the three algorithms on both synthetic datasets and real datasets. Some syn- 
thetic datasets have millions of transactions, and the size of the datasets ranges from several 
megabytes to several hundreds gigabytes. Without loss of generality, only the results for 
some synthetic datasets and a real dataset are shown here. 

All experiments were performed on a 2.0Ghz Pentium 4 with 256 MB of memory under 
Windows XP. For Diskmine and the Parallel Projection Algorithm, the size of the main 
memory is given as an input. For the Partitioning Algorithm, since it only has two database 
scans and each main-memory-sized partition and all data structures for Apriori are stored 
into main memory, the size of main memory is not controlled, and only the running time 
is recorded. 
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We first compared the performance of three algorithms on synthetic dataset. Dataset 
T100I20D100K was generated from the application of PQ. The dataset has 100,000 transac- 
tions and 1000 items, and occupies about 40 megabytes of memory. The average transaction 
length is 100, and the average pattern length is 20. The dataset is very sparse and FP-tree 
constructed from the dataset is bushy. For Apriori, a large number of candidate frequent 
itemsets will be generated from the dataset. When running the algorithms, the main mem- 
ory size was given as 128 megabytes. Figure ITUT a) shows the experimental result. In the 
figure, "Naive Algorithm" represents the Parallel Projection Algorithm, and "Aggressive 
Algorithm" represents the Diskmine algorithm. 
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Figure 10: Experiments on Synthetic Data and Real Data 



From Figure ^] (a) , we can see that the Partitioning Algorithm is the slowest is the 
group. The Naive Algorithm, however, is not slower than the Aggressive Algorithm if we 
only compare their CPU time. In j7j, where we concerned about main memory mining, 
we found that if a dataset is sparse the boosted FF 'growth* method has a much better 
performance than the original FProwth. The reason here the CPU time of the Aggressive 
Algorithm is not always less than that of Naive Algorithm is that the Aggressive Algorithm 
has to spend CPU time on calculating statistics. On the other hand, as expected, we can 
see in the figure that the disk I/O time of the Aggressive Algorithm is orders of magnitude 
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smaller than that of the Naive Algorithm. In Figure El (b) we compare the total runnng 
times. We can see that the CPU overhead used by the Aggressive Algorithm now become 
insignificant compared to the savings in disk I/O. 

We then ran the algorithms on a real dataset Kosarak, which is used as a test dataset 
in ^H]. The dataset is about 40 megabytes. Since it is a dense dataset and its FP-tree is 
pretty small, we set the main memory size as 16 megabytes for the experiments. Results 
are shown in Figure ( c )- 

In Figure ITUl (b), the Partitioning Algorithm is still the slowest. This is because it 
generates too many candidate frequent itemsets. Together with the data structures, these 
candidate sets use up main memory and virtual memory was used. We can also again notice 
that the CPU time of the Naive Algorithm is less than that of the Aggressive Algorithm. 
This is because Kosarak is a dense dataset so the array technique doesn't help a lot. In 
addition, calculating the statistics takes much time. The disk I/O's for the Aggressive 
Algorithm are still remarkably fewer than the disk I/O's for the Naive Algorithm. 

To test the effectiveness of the techniques for grouping items, we run Diskmine on 
T100I20D100K and see how close the estimation of the FP-tree size for each group is to 
its real size. We still set the main memory size as 128 megabytes, the minimum support 
is 2%. When generating the projected databases, items were grouped into 7 groups (the 
total number of frequent items is 826). As we can see from Figure ITT1 (a), in all groups, the 
estimated size is always slightly than the real size. Compared with the Naive Algorithm, 
which constructs an FP-tree for each item from its projected database, the Aggressive 
Algorithm almost fully uses the main memory for each group to construct an FP-tree. 
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Figure 11: Estimation Effect and Scalability of Diskmine 



As a divide-and-conquer algorithm, one of the most important properties of Diskmine 
is its good scalability. We ran Diskmine on a set of synthetic datasets. In all datasets, the 
item number was set as 10000 items, the average transaction length as 100, and the average 
pattern length as 20. The number of the transactions in the datasets varied from 200,000 
to 2,000,000. Datasets size ranges from 100 megabytes to 1 gigabyte. Minimum support 
was set as 1.5%, and the available main memory was 128 megabytes. Figure ^2 (b) shows 
the results. In the figure, the CPU and the disk I/O time is always kept in a small range 
of acceptable values. Even for the datasets with 2 million transactions, the total running 



21 



time is less than 1000 seconds. Extrapolating from these figures using formula (4), we can 
conclude that a dataset the size of the Library of Congress collection (25 Terabytes) could 
be mined in around 18 hours with current technology. 

5 Conclusions 

We have introduced several divide-and-conquer algorithms for mining frequent itemset from 
secondary memory. We have analyzed the recurrences and disk I/O's of all algorithms. 

We then gave a detailed divide-and-conquer algorithm which almost fully uses the 
limited main memory and saves an numerous number of disk I/O's. We introduced many 
novel techniques used in our algorithm. 

Our experimental results show that our algorithm successfully reduces the number 
of disk access, sometimes by orders of magnitude, and that our algorithm scales up to 
terabytes of data. The experiments also validates that the estimation techniques used in 
our algorithm are accurate. 

For future work, we notice that there are very few efficient algorithm for mining maximal 
frequent itemsets and closed frequent itemsets |13( I14( I17| 120] from very large databases. 
Unlike in Diskmine, where the frequent itemsets mined from all projected databases are 
globally frequent, a maximal frequent itemset or a closed frequent itemset mined from a 
projected database is only locally maximal or closed. As a challenge, a data structure, 
whose size may be also very big, must be set for recording all already discovered maximal 
or closed frequent itemsets. We also notice that our implementation of the partitioning 
algorithm is based on an existing Apriori implementation, which is not necessary highly 
optimized. As we know, there are situations when there are not too many candidate 
itemsets in a database, but the FP-tree constructed from the database is pretty big. In 
this situation the Partitioning Algorithm only needs two database scans and all frequent 
items can be nicely mined in main memory, or with very little I/O for keeping the candidate 
sets in virtual memory. In this situation Diskmine also needs two database scans, and it 
additionally needs to decompose the database. Therefore, exploring whether some clever 
disk-based datastructure would make the partition approach scale, is another interesting 
direction for further research. 
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